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The frequentist interpretation of measurement results requires the specification of an ensemble of independent 
replications of the same experiment. For complex calculations of bias, coverage, significance, etc., this ensemble 
is often simulated by running Monte Carlo pseudo-experiments. In order to be valid, the latter must obey 
the Frequentist Principle and the Anticipation Criterion. We formulate these two principles and describe some 
of their consequences in relation to stopping rules, conditioning, and nuisance parameters. The discussion is 
illustrated with examples taken from high-energy physics. 



1. Introduction 

Many statistical analyses in physics are based on a 
frequency interpretation of probability. For example, 
the result of measuring a physical constant 9 can be 
reported in the form of a 1 — a confidence interval 
[Xi , Xq\ , with the understanding that if the measure- 
ment is replicated a large number of times, one will 
have X\ < 8 < X-2 in a fraction 1 — a of the repli- 
cations. This type of interpretation therefore requires 
the definition of a reference set of similar measure- 
ments: 

The reference set of a measurement is the 
ensemble of experiments in which the actu- 
ally performed experiment is considered to 
be embedded for the purpose of interpreting 
its results in a frequentist framework. 

A major appeal of frequentism among physicists is its 
empirical definition of probability. By the strong law 
of large numbers, probabilities can be approximated in 
finite ensembles, and such approximations converge to 
the true value as the ensemble size increases. In other 
words, frequentist confidence statements are experi- 
mentally verifiable. 

Physicists use Monte Carlo generated ensembles in 
various applications: to check a fitting algorithm for 
the presence of bias, non-Gaussian pulls, or other 
pathologies; to calculate the coverage of confidence in- 
tervals or upper limits; to average out statistical fluc- 
tuations in order to isolate systematic effects; to calcu- 
late goodness-of-fit measures and significances; to de- 
sign experiments; etc. When constructing ensembles 
to address these questions, one needs to pay attention 
to a number of subtle issues that arise in a frequentist 
framework: what is the correct stopping rule?; is it ap- 
propriate to condition, and if so, on what statistic?; 
how should nuisance parameters be handled? 

The aim of this paper is to draw attention to these 
issues and to propose some recommendations where 
possible. We start by discussing basic frequentist prin- 
ciples in section |3 and illustrate them with an exam- 
ple of conditioning in section [3] The importance of 
stopping rules is argued in section Finally, some 



purely frequentist methods to handle nuisance param- 
eters are described in section |S] 



2. Frequentist Principles 

In order to deserve the label frequentist, a statisti- 
cal procedure and its associated ensemble must satisfy 
two core principles, which we examine in the next two 
subsections. 



2.1. The Frequentist Guarantee 

The first principle states the aims of frequentism: 

Frequentist Guarantee [lj: 

In repeated use of a statistical proce- 
dure, the long-run average actual accuracy 
should not be less than ( and ideally should 
equal) the long-run average reported accu- 
racy. 

To clarify this principle, we return to the 1 — a con- 
fidence interval procedure mentioned in the Introduc- 
tion. Let £ be an ensemble of intervals obtained by 
applying this procedure many times on different, in- 
dependent data. The actual accuracy of an interval in 
£ is 1 or 0: either the interval covers the true value 
of the parameter of interest, or it does not. The av- 
erage actual accuracy is therefore simply the fraction 
of intervals in £ that cover. On the other hand, the 
average reported accuracy is 1 — a. The reported ac- 
curacy is often the same for all intervals in £, but in 
some settings it is possible to report a different, data- 
dependent accuracy for each interval. Thus, averag- 
ing the reported accuracy is not necessarily a trivial 
operation. A procedure that satisfies the Frequentist 
Guarantee is said to have coverage. 

In a sense, the Frequentist Guarantee is only weakly 
constraining, because it does not require a proce- 
dure to have coverage when applied to repeated mea- 
surements of the same quantity. To see how this is 
relevant, consider the construction of a 68% confi- 
dence interval for the mean fj, of a Poisson distribu- 
tion. One procedure is to take all /i values satisfying 
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(n — /i) 2 //i < 1, where n is the observed number of 
events. The resulting interval actually undercovers for 
many values of /x and overcovers for other values, so 
that the Frequentist Guarantee appears to be satisfied 
on average. To make this statement more precise we 
need a weighting function with which to carry out the 
average over \x. A simple proposal is to perform local 
smoothing of the coverage function, resulting in local 
average coverage 0. 

Physicists may object to this notion of local average 
coverage on the grounds that they sometimes repeat- 
edly measure a given constant of nature and arc then 
interested in the coverage obtained for that particular 
constant, not in an average coverage over "nearby" 
constants. A possible answer is that one rarely mea- 
sures the quantity of interest directly. Rather, one 
measures a combination of the quantity of interest 
with calibration constants, efficiencies, sample sizes, 
etc., all of which vary from one measurement to the 
next, so that an effective averaging does take place. 

Finally, it could be argued that even Bayesians 
should subscribe to some form of the Frequentist 
Guarantee. If, over repeated use, a 95% credible 
Bayesian interval fails to cover the true value more 
than 30% of the time (say) , then there must be some- 
thing seriously wrong with that interval. 

2.2. The Anticipation Criterion 

Although the Frequentist Guarantee specifies how a 
statistical procedure should behave under many rep- 
etitions of a measurement, it does not indicate what 
constitutes a valid repetition, and hence a valid en- 
semble. To the extent that this question involves the 
notion of randomness, it is well beyond the scope of 
this paper. From a practical standpoint however, one 
would like to stipulate that all effects susceptible to 
interfere with that randomness must be recognized as 
such and included in the construction of the ensemble, 
i.e. "anticipated" • Hence the second principle: 

Anticipation Criterion: 

Ensembles must anticipate all elements of 
chance and all elements of choice of the 
actual experiments they serve to interpret. 

To clarify, "elements of chance" refers to statistical 
fluctuations of course, but also to systematic uncer- 
tainties when the latter come from nuisance parame- 
ters that are determined by auxiliary measurements. 
On the other hand, "elements of choice" refers to ac- 
tions by experimenters, in particular how they decide 
to stop the experiment, and what decisions they make 
after stopping. 

One can identify several levels of anticipation. At 
the highest level, the data collection and analysis 
methods, as well as the reference ensemble used to in- 
terpret results, are fully specified at the outset. They 



do not change once the data is observed. The refer- 
ence ensemble is called "unconditional" . 

At the second highest level, the data collection and 
analysis methods are fully specified at the outset, but 
the reference ensemble is not. The latter will be fully 
determined once the data is observed, and is therefore 
"conditional" . Although a conditional ensemble is not 
known before observing the data, it is a subset in a 
known partition of a known unconditional ensemble. 

The lowest level of anticipation is occupied by 
Bayesian methods, which fully condition on the ob- 
served data. The reference ensemble collapses to a 
point and can therefore no longer be used as a refer- 
ence. 

As the level of anticipation decreases, the reference 
ensemble becomes smaller. A remarkable result is that 
within the second level of anticipation one can refine 
the conditioning partition to the point where it is pos- 
sible to give a Bayesian interpretation to frequentist 
conclusions, and vice- versa Q. 

3. Conditioning 

To illustrate the interplay between anticipation and 
conditioning, we present here a famous example orig- 
inally due to Cox Q. Suppose we make one observa- 
tion of a rare particle and wish to estimate its mass 
\i from the momenta of its decay products. For the 
sake of simplicity, assume that the estimator X of \i is 
normal with mean [i and variance a 2 . There is a 50% 
chance that the particle decays hadronically, in which 
case a = 10; otherwise the particle decays leptonically 
and (7 = 1. Consider the following 68% confidence in- 
terval procedures: 

1. Unconditional 

If the particle decayed hadronically, report X ± 
5h, otherwise report X±S(, where Sh and Si are 
chosen so as to minimize the expected length 
(8) = Sh + Se subject to the constraint of 68% 
coverage. This yields Se = 2.20 and Sh = 5.06. 
The expected length is 7.26. 

2. Conditional 

If we condition on the decay mode, then the best 
interval is A ±10 if the particle decayed hadron- 
ically, and Ail otherwise. So the expected 
length is 11.0 in this case. 

Note that in both cases we used all the information 
available: the measurement X as well as the decay 
mode. Both procedures are valid; the only difference 
between them is the reference frame. The uncondi- 
tional ensemble includes both decay modes, whereas 
the conditional one only includes the observed decay 
mode. 

The expected length is shorter for unconditional in- 
tervals than for conditional ones. Does this mean we 
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should quote the former? If our aim is to report what 
we learned from the data we observed, then clearly we 
should report the conditional interval. Suppose indeed 
that we observed a hadronic decay. The unconditional 
interval width is then 10.12, compared to 20.0 for the 
conditional one. The reason the unconditional inter- 
val is shorter is that, if we could repeat the experi- 
ment, we might observe the particle decaying into the 
leptonic mode. However, this is irrelevant to the inter- 
pretation of the observation we actually made. This 
example illustrates a general feature of conditioning, 
that it usually increases expected length, and reduces 
power in test settings. 

Another aspect of the previous example is that the 
conditioning statistic (the decay mode) is ancillary: 
its distribution does not depend on the parameter 
of interest (the particle mass). This is not always 
the case. Suppose for example that we are given 
a sample from a normal distribution with unit vari- 
ance and unknown mean 0, and that we wish to test 
H : 6 = — 1 versus H\ : 9 = +1. The standard 
symmetric Neyman-Pearson test based on the sam- 
ple mean X as test statistic rejects Ho if X > 0. It 
makes no distinction between X = 0.5 and X = 5, 
even though in the latter case we certainly feel more 
confident in our rejection of Hq. Although X is not 
ancillary, it is possible to use it to calculate a con- 
ditional "measure of confidence" to help characterize 
one's decision regarding H 0. Unfortunately, a gen- 
eral theory for choosing such conditioning statistics 
does not exist. 



4. Stopping Rules 

Stopping rules specify how an experiment is to be 
terminated. High-energy physics experiments are of- 
ten sequential, so it is important to properly incorpo- 
rate stopping rules in the construction of ensembles. 

As a first example, consider the measurement of the 
branching fraction for the decay of a rare particle A 
into a particle B. Suppose we observe a total of n = 12 
decays, x — 9 of which are A — > B transitions, and 
the rest, r = 3, are A /> B transitions. We wish to 
test H : = 1/2 versus H 1 :9>l/2, 

A possible stopping rule is to stop the experiment 
after observing a total number of decays n. The prob- 
ability mass function (pmf) is then binomial: 

f(x;6) = ("V(l-0)*-, (1) 
and the p value for testing Ho is: 

P» = E ( l i ) 01 (! ~ 6 ) 12 ~ l = °- 075 - ( 2 ) 

i=9 ^ ' 

An equally valid stopping rule is to stop the exper- 
iment after observing a number r of A /> B decays. 



Now the pmf is negative binomial: 

f(x;6) = ( r + X x ~ 1 ^e x (l-9y, (3) 
and the p value is: 

P" b = E( i J ' ( 1 - ) 3 = °- 0325 - ( 4 ) 

If we adopt a 5% threshold for accepting or reject- 
ing Ho, we sec that the binomial model leads to ac- 
ceptance, whereas the negative binomial model leads 
to rejection. 

Here is a more intriguing example [6j. Imagine a 
physicist working at some famous particle accelerator 
and developping a procedure to select collision events 
that contain a Higgs boson. Assume that the expected 
rate of background events accepted by this procedure 
is known very accurately. Applying his technique to 
a given dataset, the physicist observes 68 events and 
expects a background of 50. The (Poisson) probability 
for 50 to fluctuate up to 68 or more is 0.89%, and the 
physicist concludes that there is significant evidence 
against Hq, the background-only hypothesis, at the 
1% level. 

Is this conclusion correct? Perhaps the physicist 
just decided to take a single sample. But what would 
he have done if this sample had not yielded a signif- 
icant result? Perhaps he would have taken another 
sample! So the real procedure the physicist was con- 
sidering is actually of the form: 

• Take a data sample, count the number ni of 
Higgs candidates, and calculate the expected 
background b; 

• If W(N > m | b) < a then stop and reject Ho; 

• Otherwise, take a second sample with the same 
expected background, count the number ri2 of 
Higgs candidates and reject Ho if P(AT > n\ + 
n 2 | 2b) < a. 

For this test procedure to have a level of 1%, a must 
be set at 0.67%. Since the actual data had a p value 
of 0.89%, the physicist should not have rejected H . 

So now the physicist finds himself forced to take 
another sample. There are two interesting cases: 

1. The second sample yields 57 candidate events, 
for a total of 125. The probability for the ex- 
pected background (100 events now) to fluctu- 
ate up to 125 or more is 0.88% > 0.67%, so 
the result is not significant. However, the re- 
sult would have been significant if the physicist 
had not stopped halfway through data taking to 
calculate the p value! 
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2. The second sample yields 59 candidate events, 
for a total of 127. The p value is now 0.52% and 
significance has been obtained, unless of course 
the physicist was planning to take a third sample 
in the event of no significance. 

Bayesian methods are generally independent of the 
stopping rule. It is therefore somewhat ironic that 
frcqucntists, who start from an objective definition of 
probability, should end up with results that depend 
on the thought processes of the experimenter. 

5. Nuisance Parameters 

Most problems of inference involve nuisance param- 
eters, i.e. uninteresting parameters that are incom- 
pletely known and therefore add to the overall uncer- 
tainty on the parameters of interest. To fix ideas, as- 
sume that we have a sample {x±, . . . , x n } whose prob- 
ability density function (pdf) fix ; /x, v) depends on 
a parameter of interest /x and a nuisance parameter 
v, and that the latter can be determined from a sep- 
arate sample {y\, . . . , y m } with pdf g(y ; v). Correct 
inference about /x must then be derived from the joint 
pdf 

h(x,tf; n,v) ee f(x; n,v) g(y; v). (5) 

What is often done in practive however, is to first 
obtain a distribution ir(y) for v, usually by combining 
measurement results with a sensible guess for the form 
of tt(v). Inference about /i is then based on: 

h'(x; n) ee J f{x ; /x, v) ir{v) dv. (6) 

Although this technique borrows elements from both 
Bayesian and frequentist methodologies, it really be- 
longs to neither and is more properly referred to as a 
hybrid non- frequentist /non-Bayesian approach. 

We illustrate the handling of nuisance parameters 
with a simple p value calculation. Suppose that a 
search for a new particle ends with a sample of no = 12 
candidates over a separately measured background of 
Vq = 5.7±0.47, where we ignore the uncertainty on the 
standard error 0.47. Let u be the unknown expected 
number of new particles among the 12 candidates. We 
wish to test Hq : /i = versus Hi : /x > 0. 

A typical model for this problem consists of a Pois- 
son density for the number of observed candidates and 
a Gaussian for the background measurement. Using 
equation © with a simple Monte Carlo integration 
routine, one obtains a p value of ~ 1.6%. For refer- 
ence, when there is no uncertainty on vq the p value 
is - 1.4%. 

While there are many purely frequentist approaches 
to the elimination of nuisance parameters, few of these 
have general applicability. Concentrating on the lat- 
ter, we discuss the likelihood ratio and confidence in- 
terval methods in the next two subsections. 



5.1. Likelihood Ratio Method 

The likelihood ratio statistic A is defined by: 
sup C{n,v |n ,fo) 

v>0 , 

A = — -r, ; r , (7) 

sup C{ii,v n ,fo) 

A*>0 
i/>0 

where, for vq 3> Av: 

(U + V) n ° 1 ( U-Vg \ 2 

C(u,v\na,v ) oc ^ / e~^ v e^H^/ 1 . 
Simple calculus leads to: 

-21nA = 2(n ln^ + I >-no) + (%^) 2 if n Q >is , 
= if n < v Q , 

with: v = v °-^ + y /, ( ' yo ' 2 At/2 ) 2 +n Av 2 . 

Since A depends on no and vq, its distribution under 
Hq depends on the true expected background v t . A 
natural simplification is to examine the limit v t — * oo. 
Application of theorems describing the asymptotic 
behavior of —2 In A must take into account that for 
no < vq the analytical maximum of the likelihood lies 
outside the physical region /x > 0. The correct asymp- 
totic result is that, under Hq, half a unit of probability 
is carried by the singleton {—2 In A = 0}, and the other 
half is distributed as a chisquared with one degree of 
freedom over < —2 In A < +oo. 

For our example the expected background is only 
5.7 particles however, so one may wonder how close 
this is to the asymptotic limit. Here is an algorithm 
to check this. Choose a true number of background 
events v t and repeat the following three steps a large 
number of times: 

1. Generate a Gaussian variate v$ with mean vt 
and width Av\ 

2. Generate a Poisson variate no with mean u t ; 

3. Calculate A from the generated vq and no- 

The p value is then equal to the fraction of pseudo- 
experiments that yield a likelihood ratio A smaller 
than the Ao obtained from the observed data. 

Note that this algorithm does not "smear" the true 
value of any parameter, in contrast with equation ©. 
The price for this is that the result depends on the 
choice of vt- For Vt varying from 0.5 to 50, the p value 
ranges from ~ 0.48 to ~ 1.2%. A general prescription 
for dealing with a p value dependence on nuisance 
parameters is to use the so-called supremum p value: 

Psup = sup F(-21nA > -21nA | u,^)| M=0 
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From a frequentist point of view, the supremum p 
value is valid, in the sense that: 



IP(Psup < ot) < a, for each a £ [0, 1], 



(8) 



regardless of the true value of the nuisance parameter. 
Although it is often difficult to calculate a supremum, 
in this case it turns out to equal the asymptotic limit 
to a good approximation. In our example — 21nAo = 
5.02 and corresponds to p sup ~ — 1.25%. 

As the attentive reader will have noticed, the p value 
is smaller for Av — 0.47 than for Av = 0. This is a 
consequence of the discreteness of Poisson statistics; 
it does not violate inequality © because p sup actually 
overcovers a little when At/ = 0. To avoid the bias re- 
sulting from this overcoverage, the use of mid-p values 
is sometimes advocated for the purpose of comparing 
or combining p values ■ 

5.2. Confidence Interval Method 

The supremum p value introduced in the previous 
section can be defined for any test statistic, although 
it will not always give useful results. If for example in 
our new particle search we take the total number no of 
observed candidates as test statistic, the p value will 
be 100% since the background v is unbounded from 
above. A more satisfactory method proceeds as fol- 
lows First, construct a 1-/3 confidence interval 
Cp for the nuisance parameter v, then maximize the p 
value over that interval, and finally correct the result 
for the fact that j3 ^ 0: 



sup 



P(JV>no| M ,z/)L =0 



It can be shown that this is also a valid p value. 

For the sake of illustration with our example, we 
consider three choices of (3 and construct the corre- 
sponding 1 — (3 confidence intervals for v t : 



1-/3= 99.5%: 
1-/3= 99.9%: 
1 - P = 99.99% 



Co. 005 

Co. ooi 
Co. oooi 



= [4.38 , 7.02] 
= [4.15, 7.25] 
= [3.87, 7.53] 



To calculate the p value, a good choice of statistic is 
the maximum likelihood estimator of the signal, i.e. 
s = rio — Uq. Under Ho, the survivor function of s is 
given by: 



1 + erf 



P(5>a) = 



k — vt—s 



k=0 1 



erf 



V2 Ai* 



kl 



We then find: 

1-/3= 99.5%: pp 
1-/3= 99.9%: pp 
1-/3= 99.99%: pp 



1.6% +0.5% = 2.1% 
1.7% +0.1% = 1.8% 
1.88%+0.01% = 1.89% 



An important point about the confidence interval 
method is that, in order to satisfy the Anticipation 
Criterion, the value of (3 and the confidence set Cp 
must be specified before looking at the data. Since pp 
is never smaller than /3, the latter should be small. In 
particular, if pp is used in a level-a test, then (3 must 
be smaller than a for the test to be useful. 



6. Summary 

From the practical point of view of someone ana- 
lyzing data, the most critical property of frequentist 
ensembles is their "anticipatoriness." This requires 
that all the structural elements of an analysis (i.e. 
test sizes, interval procedures, bin boundaries, stop- 
ping rules, etc.) be in place before looking at the 
data. The only exception to this requirement occurs 
in situations where conditioning is both possible and 
appropriate. Even in that case, the conditioning par- 
tition itself must be specified beforehand. 
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